Predicting Belgian Real Estate Prices: Part 2: Data Cleanup After Web Scraping

Predicting Belgian Real Estate Prices
Author

Adam Cseresznye

Published

November 4, 2023

Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash

In part 1, we provided a brief introduction to the project’s purpose. Now, in part 2, we will dive deeper into the data processing steps required after scraping. We will discuss the handling of numerical data, categorical variables, and boolean values. Additionally, we’ll assess the data quality by examining the error log generated by the Immowebscraper class. Let’s get to it!

Note

You can access the project’s app through its Streamlit website.

Import data

Code
import time
from pathlib import Path

import numpy as np
import pandas as pd
from data import pre_process, utils
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

Data Pre-cleaning steps

In the upcoming section, we will delve into fundamental post-web scraping procedures. While scraping, we acquired 50 features, making our dataset information-rich. However, there’s significant work ahead due to the data’s lack of cleanliness. This entails tasks such as dtype conversion, column parsing to extract numerical values, and the transformation of Boolean values into binary variables.
Here is the dataset we have gathered through web scraping:

Code
for filename in utils.Configuration.RAW_DATA_PATH.glob("*.gzip"):
    if "for_NB2" in filename.stem:
        df = pd.read_parquet(filename)
print(df.shape)
df.head().style.set_sticky(axis=0)
(3906, 50)
  Available as of Construction year Building condition Street frontage width Number of frontages Covered parking spaces Outdoor parking spaces Surroundings type Living area Living room surface Dining room Kitchen type Kitchen surface Bedrooms Bedroom 1 surface Bedroom 2 surface Bedroom 3 surface Bathrooms Toilets Office Basement Furnished Surface of the plot Connection to sewer network Gas, water & electricity Garden surface TV cable Primary energy consumption Energy class Reference number of the EPC report CO₂ emission Yearly theoretical total energy consumption Heating type Double glazing Price Cadastral income Tenement building Address Website External reference day_of_retrieval ad_url As built plan Latest land use designation Proceedings for breach of planning regulations Possible priority purchase right Subdivision permit Flood zone type Planning permission obtained Width of the lot on the street
0 After signing the deed 1975 Good 23 m 4 2 2 Living area (residential, urban or rural) 420 m² square meters 63 m² square meters Yes USA hyper equipped 21 m² square meters 5 28 m² square meters 24 m² square meters 21 m² square meters 4 5 Yes Yes No 4677 m² square meters Not connected Yes 4300 m² square meters Yes 296 kWh/m² kilowatt hour per square meters D 20210910012683 74 kg CO₂/m² 133062 kWh/year Fuel oil Yes € 1,225,000 1225000 € € 7,615 7615 € No drève richelle 96 1410 - Waterloo http://www.bytheway.be 5530019 2023-09-27 14:02:10.370650 https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 None None None None None None None None
1 After signing the deed 1850 To renovate 9 m 2 1 None Urban 113 m² square meters None Yes Semi equipped None 2 14 m² square meters 10 m² square meters None 1 1 None None No 73 m² square meters Connected None None Yes 307 kWh/m² kilowatt hour per square meters D 20230306014621 76 kg CO₂/m² 43587 kWh/year Fuel oil Yes Make an offer starting from € 89,000 Make an offer starting from 89000 € € 541 541 € No Chaussée de l'Ourthe 65 6900 - Marche-en-Famenne http://www.weinvest.be 5246367 2023-09-27 14:02:10.641758 https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 No Living area (residential, urban or rural) No No No None None 10 m meters
2 After signing the deed 1949 To renovate 18 m 3 1 None Isolated 139 m² square meters 10 m² square meters Yes Semi equipped 13 m² square meters 2 17 m² square meters 11 m² square meters None 1 1 Yes Yes No 413 m² square meters Connected None 315 m² square meters Yes 699 kWh/m² kilowatt hour per square meters G 20230303013078 173 kg CO₂/m² 96913 kWh/year Fuel oil Yes Make an offer starting from € 150,000 Make an offer starting from 150000 € € 689 689 € No Rue de la Wallonie 2A 4680 - Oupeye http://www.nigel-immo.be 5534704 2023-09-27 14:02:10.905808 https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 None Living area (residential, urban or rural) No No No Non flood zone None 18 m meters
3 After signing the deed 1899 Good 4.2 m 2 None None Urban 119 m² square meters 14 m² square meters None USA installed 17 m² square meters 3 14 m² square meters 14 m² square meters 10 m² square meters 2 2 None None No 55 m² square meters Connected Yes None Yes 246 kWh/m² kilowatt hour per square meters C 20230622-0002923649-RES-1 4167 kg CO₂/m² Not specified Gas Yes € 272,000 272000 € € 898 898 € No Sint-Denijslaan 1 9000 - Gent http://www.immodavinci.be 5535455 2023-09-27 14:02:11.811691 https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 None Living area (residential, urban or rural) No Yes No Non flood zone Yes None
4 At delivery 2021 As new 7.3 m 3 None 1 Living area (residential, urban or rural) 215 m² square meters 46 m² square meters None USA hyper equipped None 3 16 m² square meters 12 m² square meters 12 m² square meters 1 2 None None No 330 m² square meters Connected Yes 250 m² square meters Yes 27 kWh/m² kilowatt hour per square meters A 12345 Not specified Not specified Gas Yes € 413,150 413150 € None No Astridlaan 1 1700 - Dilbeek http://www.living-stone.be 5527171 2023-09-27 14:02:11.968969 https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 None Living area (residential, urban or rural) None No Yes Non flood zone Yes None

Based on the dataset, we’ve identified two primary tasks that need to be performed across multiple columns:

  1. Handling Numerical Columns: This involves extracting numerical data using regex and converting it to float format.

  2. Dealing with Binary Columns: Many columns contain binary values, such as “Yes” and “No.” We can easily convert these columns to boolean data types instead of string representations.

  3. Special Handling for Certain Columns: Some columns, like “flood_zone_type” and “connection_to_sewer_network,” also have low cardinality and should be converted to boolean values. However, their values do not align with the typical “True” and “False” boolean mapping. Instead, they require a unique dictionary mapping compared to the other boolean columns.

The pre_process_dataframe function below serves as a solid starting point. Some may suggest that breaking it into multiple subfunctions could improve maintainability and enable unit testing. However, for the time being, we’ll maintain it in its current form.

Code
def pre_process_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """
    Preprocesses a DataFrame by performing various data cleaning and transformation tasks.

    Args:
        df (pandas.DataFrame): The input DataFrame to be preprocessed.

    Returns:
        pandas.DataFrame: The preprocessed DataFrame.
    """

    def extract_numbers(df: pd.DataFrame, columns: list):
        """
        Extracts numeric values from specified columns in the DataFrame.

        Args:
            df (pandas.DataFrame): The DataFrame to extract values from.
            columns (list): List of column names to extract numeric values from.

        Returns:
            pandas.DataFrame: The DataFrame with extracted numeric values.
        """
        for column in columns:
            try:
                df[column] = df[column].str.extract(r"(\d+)").astype("float32")
            except Exception as e:
                print(f"Error processing column {column}: {e}")
        return df

    def map_values(df: pd.DataFrame, columns: list):
        """
        Maps boolean values in specified columns to True, False, or None.

        Args:
            df (pandas.DataFrame): The DataFrame to map values in.
            columns (list): List of column names with boolean values to be mapped.

        Returns:
            pandas.DataFrame: The DataFrame with mapped boolean values.
        """
        for column in columns:
            try:
                df[column] = df[column].map({"Yes": True, None: False, "No": False})
            except Exception as e:
                print(f"Error processing column {column}: {e}")
        return df

    number_columns = [
        "construction_year",
        "street_frontage_width",
        "number_of_frontages",
        "covered_parking_spaces",
        "outdoor_parking_spaces",
        "living_area",
        "living_room_surface",
        "kitchen_surface",
        "bedrooms",
        "bedroom_1_surface",
        "bedroom_2_surface",
        "bedroom_3_surface",
        "bathrooms",
        "toilets",
        "surface_of_the_plot",
        "width_of_the_lot_on_the_street",
        "garden_surface",
        "primary_energy_consumption",
        "co2_emission",
        "yearly_theoretical_total_energy_consumption",
    ]

    boolean_columns = [
        "basement",
        "furnished",
        "gas_water__electricity",
        "double_glazing",
        "planning_permission_obtained",
        "tv_cable",
        "dining_room",
        "proceedings_for_breach_of_planning_regulations",
        "subdivision_permit",
        "tenement_building",
        "possible_priority_purchase_right",
    ]

    return (
        df.sort_index(axis=1)
        .fillna(np.nan)
        .rename(
            columns=lambda column: column.lower()
            .replace(" ", "_")
            .replace("&", "")
            .replace(",", "")
        )
        .rename(columns={"co₂_emission": "co2_emission"})
        .pipe(lambda df: extract_numbers(df, number_columns))
        .pipe(lambda df: map_values(df, boolean_columns))
        .assign(
            flood_zone_type=lambda df: df.flood_zone_type.map(
                {
                    "Non flood zone": False,
                    "No": False,
                    "Possible flood zone": True,
                }
            ),
            connection_to_sewer_network=lambda df: df.connection_to_sewer_network.map(
                {
                    "Connected": True,
                    "Not connected": False,
                }
            ),
            as_built_plan=lambda df: df.as_built_plan.map(
                {
                    "Yes, conform": True,
                    "No": False,
                }
            ),
            cadastral_income=lambda df: df.cadastral_income.str.split(" ", expand=True)[
                3
            ].astype("float32"),
            price=lambda df: df.price.str.rsplit(" ", expand=True, n=2)[1].astype(
                float
            ),
        )
    )


df_pre_processed = pre_process_dataframe(df)
df_pre_processed.head().style.set_sticky(axis=0)
  address as_built_plan available_as_of basement bathrooms bedroom_1_surface bedroom_2_surface bedroom_3_surface bedrooms building_condition co2_emission cadastral_income connection_to_sewer_network construction_year covered_parking_spaces dining_room double_glazing energy_class external_reference flood_zone_type furnished garden_surface gas_water__electricity heating_type kitchen_surface kitchen_type latest_land_use_designation living_area living_room_surface number_of_frontages office outdoor_parking_spaces planning_permission_obtained possible_priority_purchase_right price primary_energy_consumption proceedings_for_breach_of_planning_regulations reference_number_of_the_epc_report street_frontage_width subdivision_permit surface_of_the_plot surroundings_type tv_cable tenement_building toilets website width_of_the_lot_on_the_street yearly_theoretical_total_energy_consumption ad_url day_of_retrieval
0 drève richelle 96 1410 - Waterloo nan After signing the deed True 4.000000 28.000000 24.000000 21.000000 5.000000 Good 74.000000 7615.000000 False 1975.000000 2.000000 True True D 5530019 nan False 4300.000000 True Fuel oil 21.000000 USA hyper equipped nan 420.000000 63.000000 4.000000 Yes 2.000000 nan nan 1225000.000000 296.000000 nan 20210910012683 23.000000 nan 4677.000000 Living area (residential, urban or rural) True False 5.000000 http://www.bytheway.be nan 133062.000000 https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 2023-09-27 14:02:10.370650
1 Chaussée de l'Ourthe 65 6900 - Marche-en-Famenne False After signing the deed nan 1.000000 14.000000 10.000000 nan 2.000000 To renovate 76.000000 541.000000 True 1850.000000 1.000000 True True D 5246367 nan False nan nan Fuel oil nan Semi equipped Living area (residential, urban or rural) 113.000000 nan 2.000000 nan nan nan False 89000.000000 307.000000 False 20230306014621 9.000000 False 73.000000 Urban True False 1.000000 http://www.weinvest.be 10.000000 43587.000000 https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 2023-09-27 14:02:10.641758
2 Rue de la Wallonie 2A 4680 - Oupeye nan After signing the deed True 1.000000 17.000000 11.000000 nan 2.000000 To renovate 173.000000 689.000000 True 1949.000000 1.000000 True True G 5534704 False False 315.000000 nan Fuel oil 13.000000 Semi equipped Living area (residential, urban or rural) 139.000000 10.000000 3.000000 Yes nan nan False 150000.000000 699.000000 False 20230303013078 18.000000 False 413.000000 Isolated True False 1.000000 http://www.nigel-immo.be 18.000000 96913.000000 https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 2023-09-27 14:02:10.905808
3 Sint-Denijslaan 1 9000 - Gent nan After signing the deed nan 2.000000 14.000000 14.000000 10.000000 3.000000 Good 4167.000000 898.000000 True 1899.000000 nan nan True C 5535455 False False nan True Gas 17.000000 USA installed Living area (residential, urban or rural) 119.000000 14.000000 2.000000 nan nan True True 272000.000000 246.000000 False 20230622-0002923649-RES-1 4.000000 False 55.000000 Urban True False 2.000000 http://www.immodavinci.be nan nan https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 2023-09-27 14:02:11.811691
4 Astridlaan 1 1700 - Dilbeek nan At delivery nan 1.000000 16.000000 12.000000 12.000000 3.000000 As new nan nan True 2021.000000 nan nan True A 5527171 False False 250.000000 True Gas nan USA hyper equipped Living area (residential, urban or rural) 215.000000 46.000000 3.000000 nan 1.000000 True False 413150.000000 27.000000 nan 12345 7.000000 True 330.000000 Living area (residential, urban or rural) True False 2.000000 http://www.living-stone.be nan nan https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 2023-09-27 14:02:11.968969

A crucial task on our agenda is the thorough parsing of address information. This is where the separate_address function comes into play. While this function provides a solid foundation for parsing location details, we’ve found that achieving high accuracy using regex alone can be challenging. For this reason, we’ve chosen to leverage the Google Maps API to extract location details effectively.

The function that’s integrated into our final pipeline is get_location_details_from_google, which can be located in the pre_process.py module. This step is essential for extracting key details such as the city, ZIP code, house number, and stre reliably with high accuracyet. As we all know, location plays a vital role in real estate price estimation.

It’s worth mentioning that we’ve removed the original address field to eliminate redundant data and ensure our dataset is more streamlined and efficions.

Code
def separate_address(df: pd.DataFrame) -> pd.DataFrame:
    """Separates the address into city, street name, house number, and zip code.

    Args:
        df (pd.DataFrame): The DataFrame containing the address column.

    Returns:
        pd.DataFrame: The DataFrame with the address separated into different columns.
    """
    # Define a regular expression pattern to extract street, house number, and zip code
    pattern = r"(?P<street_name>.*?)\s*(?P<house_number>\d+\w*)?\s*(?P<zip>\d{4})"

    try:
        return df.assign(
            city=lambda df: df.address.str.rsplit("-", expand=True, n=1)[1].str.title(),
            **(lambda dfx: dfx.rename(columns={"address": "original_address"}))(
                df["address"].str.extract(pattern)
            ),
            street=lambda df: df.street_name.str.replace(
                r"[^a-zA-Z\s]", "", regex=True
            ),
        ).drop(columns=["street_name", "address"])
    except Exception as e:
        print(f"Error separating address: {e}")
        return df


finer_pre_cleaned = separate_address(df_pre_processed)
finer_pre_cleaned.head().style.set_sticky(axis=0)
  as_built_plan available_as_of basement bathrooms bedroom_1_surface bedroom_2_surface bedroom_3_surface bedrooms building_condition co2_emission cadastral_income connection_to_sewer_network construction_year covered_parking_spaces dining_room double_glazing energy_class external_reference flood_zone_type furnished garden_surface gas_water__electricity heating_type kitchen_surface kitchen_type latest_land_use_designation living_area living_room_surface number_of_frontages office outdoor_parking_spaces planning_permission_obtained possible_priority_purchase_right price primary_energy_consumption proceedings_for_breach_of_planning_regulations reference_number_of_the_epc_report street_frontage_width subdivision_permit surface_of_the_plot surroundings_type tv_cable tenement_building toilets website width_of_the_lot_on_the_street yearly_theoretical_total_energy_consumption ad_url day_of_retrieval city house_number zip street
0 nan After signing the deed True 4.000000 28.000000 24.000000 21.000000 5.000000 Good 74.000000 7615.000000 False 1975.000000 2.000000 True True D 5530019 nan False 4300.000000 True Fuel oil 21.000000 USA hyper equipped nan 420.000000 63.000000 4.000000 Yes 2.000000 nan nan 1225000.000000 296.000000 nan 20210910012683 23.000000 nan 4677.000000 Living area (residential, urban or rural) True False 5.000000 http://www.bytheway.be nan 133062.000000 https://www.immoweb.be/en/classified/villa/for-sale/braine-l%27alleud/1420/10844811 2023-09-27 14:02:10.370650 Waterloo 96 1410 drve richelle
1 False After signing the deed nan 1.000000 14.000000 10.000000 nan 2.000000 To renovate 76.000000 541.000000 True 1850.000000 1.000000 True True D 5246367 nan False nan nan Fuel oil nan Semi equipped Living area (residential, urban or rural) 113.000000 nan 2.000000 nan nan nan False 89000.000000 307.000000 False 20230306014621 9.000000 False 73.000000 Urban True False 1.000000 http://www.weinvest.be 10.000000 43587.000000 https://www.immoweb.be/en/classified/town-house/for-sale/jemelle/5580/10843929 2023-09-27 14:02:10.641758 Famenne 65 6900 Chausse de lOurthe
2 nan After signing the deed True 1.000000 17.000000 11.000000 nan 2.000000 To renovate 173.000000 689.000000 True 1949.000000 1.000000 True True G 5534704 False False 315.000000 nan Fuel oil 13.000000 Semi equipped Living area (residential, urban or rural) 139.000000 10.000000 3.000000 Yes nan nan False 150000.000000 699.000000 False 20230303013078 18.000000 False 413.000000 Isolated True False 1.000000 http://www.nigel-immo.be 18.000000 96913.000000 https://www.immoweb.be/en/classified/house/for-sale/esneux%20tilff/4130/10838582 2023-09-27 14:02:10.905808 Oupeye 2A 4680 Rue de la Wallonie
3 nan After signing the deed nan 2.000000 14.000000 14.000000 10.000000 3.000000 Good 4167.000000 898.000000 True 1899.000000 nan nan True C 5535455 False False nan True Gas 17.000000 USA installed Living area (residential, urban or rural) 119.000000 14.000000 2.000000 nan nan True True 272000.000000 246.000000 False 20230622-0002923649-RES-1 4.000000 False 55.000000 Urban True False 2.000000 http://www.immodavinci.be nan nan https://www.immoweb.be/en/classified/house/for-sale/gent/9000/10838392 2023-09-27 14:02:11.811691 Gent 1 9000 SintDenijslaan
4 nan At delivery nan 1.000000 16.000000 12.000000 12.000000 3.000000 As new nan nan True 2021.000000 nan nan True A 5527171 False False 250.000000 True Gas nan USA hyper equipped Living area (residential, urban or rural) 215.000000 46.000000 3.000000 nan 1.000000 True False 413150.000000 27.000000 nan 12345 7.000000 True 330.000000 Living area (residential, urban or rural) True False 2.000000 http://www.living-stone.be nan nan https://www.immoweb.be/en/classified/house/for-sale/affligem/1790/10842333 2023-09-27 14:02:11.968969 Dilbeek 1 1700 Astridlaan

Inspecting data quality

Reading in and inspecting the log file

Analyzing the error log file, we’ve identified a total of 3,515 errors encountered during the web scraping process on the Immoweb website. Let’s delve into these errors to pinpoint the most common issues and address them accordingly.

Code
error_log = pd.read_table(
    utils.Configuration.RAW_DATA_PATH.joinpath("make_dataset_error_for_NB2.log"),
    header=None,
).rename(columns={0: "error"})

error_log
error
0 2023-09-27 14:02:11 - ERROR - No tables found ...
1 2023-09-27 14:02:17 - ERROR - No tables found ...
2 2023-09-27 14:02:22 - ERROR - No tables found ...
3 2023-09-27 14:02:25 - ERROR - No tables found ...
4 2023-09-27 14:02:34 - ERROR - No tables found ...
... ...
3510 2023-09-27 16:43:45 - ERROR - Duplicate labels...
3511 2023-09-27 16:43:57 - ERROR - No tables found ...
3512 2023-09-27 16:43:57 - ERROR - Duplicate labels...
3513 2023-09-27 16:43:57 - ERROR - Duplicate labels...
3514 2023-09-27 16:44:03 - ERROR - No tables found ...

3515 rows × 1 columns

Most common errors from log file

It’s clear that a significant majority of the errors, accounting for 1,848 cases, result from the absence of tables on the pages. These errors are primarily found on listing ads and index pages. To address this issue, we’ve introduced an if clause into our method extract_ads_from_given_page, which can be found in the make_dataset.py module. The clause, if "immoweb.be" in item and "https://www.immoweb.be/en/search" not in item, enables us to filter out undesired pages that don’t contain relevant table information for our ads. This not only helps mitigate errors but also speeds up the dataset collection process by reducing the number of pages we scrape.

Another category of errors, totaling 1,460 cases, is related to the presence of duplicate labels during processing. We may need to investigate this issue further at a later stage to ensure data quality and accuracy.

A smaller proportion of errors is linked to the “Empty data” message, primarily related to ads. Finally, the remaining errors encompass errorrs related to data type conversion. We can consider either leaving these columns as is, since the error is not that frequent, or removing these features altogether.

Code
(
    error_log.error.str.split("-", expand=True)[4]
    .str.rsplit(" ", n=1, expand=True)[0]
    .value_counts()
)
0
 No tables found while processing                                                                                                        1848
 Duplicate labels found while processing                                                                                                 1460
 Empty data while processing                                                                                                               18
 An error occurred on page 327: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 255: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
                                                                                                                                         ... 
 An error occurred on page 173: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 174: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 176: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 177: ("Expected bytes, got a 'int' object", 'Conversion failed for column Number of frontages with type          1
 An error occurred on page 331: ("Expected bytes, got a 'int' object", 'Conversion failed for column Outdoor parking spaces with type       1
Name: count, Length: 192, dtype: int64

We’ve identified five columns responsible for the last set of data type conversion problems (See below):

  1. Construction year
  2. Number of frontages
  3. Outdoor parking spaces
  4. Covered parking spaces
  5. Bedrooms
Code
(
    error_log.error.str.split("-", expand=True)[4]
    .str.rsplit(" ", n=1, expand=True)[0]
    .value_counts()[3:]
    .to_frame()
    .reset_index()
    .rename(columns={0: "error_type"})
    .assign(
        error_type=lambda df: df.error_type.str.split(",", expand=True)[2].str.split(
            " ", n=5, expand=True
        )[5]
    )
    .error_type.value_counts()
)
error_type
Construction year with type         83
Number of frontages with type       65
Outdoor parking spaces with type     7
Covered parking spaces with type     7
Bedrooms with type                   1
Name: count, dtype: int64

Unique URLs from the error logs

Upon conducting a more comprehensive analysis of the URLs extracted from error messages, a noteworthy observation comes to light: we’ve encountered only 433 unique URLs. This suggests that the 3,515 errors are stemming from a relatively restricted set of web addresses.

Now, here’s a question that arises: the website implies the presence of 10,000 ads on the page. However, given our successful extraction of only 3,906 ads, along with the 433 URLs associated with errors, there is a substantial disparity evident.

If you have any insights or hypotheses regarding this difference, I’d be eager to hear your thoughts and discuss potential reasons for this variation.

Code
(
    error_log.error.str.split("-", expand=True)[4]
    .str.rsplit(" ", n=1, expand=True)[1]
    .unique()
    .shape
)
(433,)

Inspecting the data itself

After eliminating rows where all values were missing and filtering for rows with non-missing prices, we’ve successfully refined our dataset to include 3,660 ads.

In our subsequent analysis, we focus on the features with the lowest percentage of missing da, just like we did in Part 1ta. Notably, “day of retrieval” and “price” are complete, with all values present. However, it’s important to recognize that roughly one-third of the data related to “dining_room” and “office” is missing, highlighting the need for improving data completeness in these specific attributet.s.

Code
(
    finer_pre_cleaned.dropna(axis=0, how="all")
    .query("price.notna()")
    .notna()
    .sum()
    .sort_values()
    .div(3660)
    .mul(100)
    .round(1)
)
dining_room                                        29.3
office                                             29.5
planning_permission_obtained                       32.1
tv_cable                                           34.0
proceedings_for_breach_of_planning_regulations     36.3
subdivision_permit                                 38.2
yearly_theoretical_total_energy_consumption        39.5
width_of_the_lot_on_the_street                     42.4
co2_emission                                       43.0
connection_to_sewer_network                        44.7
possible_priority_purchase_right                   45.5
street_frontage_width                              46.5
basement                                           46.6
as_built_plan                                      47.0
latest_land_use_designation                        47.1
garden_surface                                     47.4
outdoor_parking_spaces                             48.0
furnished                                          48.4
surroundings_type                                  50.8
flood_zone_type                                    54.4
bedroom_3_surface                                  54.8
covered_parking_spaces                             54.9
kitchen_surface                                    58.2
available_as_of                                    58.3
living_room_surface                                64.7
bedroom_2_surface                                  66.6
gas_water__electricity                             67.0
bedroom_1_surface                                  68.2
construction_year                                  71.0
cadastral_income                                   78.3
kitchen_type                                       83.4
double_glazing                                     84.6
website                                            84.6
heating_type                                       87.2
external_reference                                 90.3
toilets                                            90.8
number_of_frontages                                94.6
bathrooms                                          94.9
house_number                                       94.9
primary_energy_consumption                         96.0
building_condition                                 96.0
surface_of_the_plot                                96.6
living_area                                        98.2
tenement_building                                  99.1
zip                                                99.5
city                                               99.5
bedrooms                                           99.5
street                                             99.5
reference_number_of_the_epc_report                 99.9
energy_class                                       99.9
price                                             100.0
ad_url                                            100.0
day_of_retrieval                                  100.0
dtype: float64

Our filter_out_missing_indexes function proves to be quite valuable in the post-processing of our scraped data. This function is located at the end of our pre-process chain saving our data to the INTERIM_DATA_PATH folder after we’ve completed pre-processing and removed missing values.

Code
def filter_out_missing_indexes(
    df: pd.DataFrame,
    filepath: Path = utils.Configuration.INTERIM_DATA_PATH.joinpath(
        f"{str(pd.Timestamp.now())[:10]}_Processed_dataset.parquet.gzip"
    ),
) -> pd.DataFrame:
    """
    Filter out rows with missing values in a DataFrame and save the processed dataset.

    This function filters out rows with all missing values (NaN) and retains only rows
    with non-missing values in the 'price' column. The resulting DataFrame is then saved
    in Parquet format with gzip compression.

    Args:
        df (pd.DataFrame): The input DataFrame.
        filepath (Path, optional): The path to save the processed dataset in Parquet format.
            Defaults to a timestamp-based filepath in the interim data directory.

    Returns:
        pd.DataFrame: The filtered DataFrame with missing rows removed.

    Example:
        To filter out missing rows and save the processed dataset:
        >>> data = pd.read_csv("raw_data.csv")
        >>> filtered_data = filter_out_missing_indexes(data)
        >>> print(filtered_data.head())

    Notes:
        - Rows with missing values in any column other than 'price' are removed.
        - The processed dataset is saved with gzip compression to conserve disk space.
    """
    processed_df = df.dropna(axis=0, how="all").query("price.notna()")
    processed_df.to_parquet(filepath, compression="gzip", index=False)
    return processed_df

It appears that we’ve successfully completed the data transformation phase of our scraped dataset. With the implementation of functions like filter_out_missing_indexes, alongside pre_process_dataframe and separate_address, we’ve assembled the essential tools required for preparing our dataset for the machine learning pipeline.

In Part 3, we’ll provide a fundamental overview and characterization of the cleaned scraped data. We’ll assess feature cardinality, examine distributions, and explore correlations among variables. I look forward to delving into these insights with you in the next installment. See you there!